First, we read in the data and set it up for analysis. The data is mostly cleaned, but we need a subset for calculating correlation, we need to change some data to be categorical, some data to be numerical, and we need to fix the dates so that they aren’t read in as characters.
Without doing anything, our dataset is as follows:
# read in dataset
airbnb <- data.frame(read.csv("NYC ABS.csv", header = TRUE))
#structure of dataset
str(airbnb)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "10/19/2018" "5/21/2019" "" "7/5/2019" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
After cleaning, our main dataset is described below:
str(airbnb)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : num 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Date, format: "2018-10-19" "2019-05-21" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num 365 355 365 194 0 129 0 220 0 188 ...
Our secondary dataset (used to measure correlation) is described below:
str(airbnb_cor)
## 'data.frame': 48895 obs. of 6 variables:
## $ price : num 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num 9 45 0 270 9 74 49 430 118 160 ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num 365 355 365 194 0 129 0 220 0 188 ...
Our dataset has 48895 observations.
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Min. : 2539 | Length:48895 | Min. :2.44e+03 | Length:48895 | Bronx : 1091 | Williamsburg : 3920 | Min. :40.5 | Min. :-74.2 | Entire home/apt:25409 | Min. : 0 | Min. : 1 | Min. : 0 | Min. :2011-03-28 | Min. : 0 | Min. : 1 | Min. : 0 |
| Q1 | 1st Qu.: 9471945 | Class :character | 1st Qu.:7.82e+06 | Class :character | Brooklyn :20104 | Bedford-Stuyvesant: 3714 | 1st Qu.:40.7 | 1st Qu.:-74.0 | Private room :22326 | 1st Qu.: 69 | 1st Qu.: 1 | 1st Qu.: 1 | 1st Qu.:2018-07-08 | 1st Qu.: 0 | 1st Qu.: 1 | 1st Qu.: 0 |
| Median | Median :19677284 | Mode :character | Median :3.08e+07 | Mode :character | Manhattan :21661 | Harlem : 2658 | Median :40.7 | Median :-74.0 | Shared room : 1160 | Median : 106 | Median : 3 | Median : 5 | Median :2019-05-19 | Median : 1 | Median : 1 | Median : 45 |
| Mean | Mean :19017143 | NA | Mean :6.76e+07 | NA | Queens : 5666 | Bushwick : 2465 | Mean :40.7 | Mean :-74.0 | NA | Mean : 153 | Mean : 7 | Mean : 23 | Mean :2018-10-04 | Mean : 1 | Mean : 7 | Mean :113 |
| Q3 | 3rd Qu.:29152178 | NA | 3rd Qu.:1.07e+08 | NA | Staten Island: 373 | Upper West Side : 1971 | 3rd Qu.:40.8 | 3rd Qu.:-73.9 | NA | 3rd Qu.: 175 | 3rd Qu.: 5 | 3rd Qu.: 24 | 3rd Qu.:2019-06-23 | 3rd Qu.: 2 | 3rd Qu.: 2 | 3rd Qu.:227 |
| Max | Max. :36487245 | NA | Max. :2.74e+08 | NA | NA | Hell’s Kitchen : 1958 | Max. :40.9 | Max. :-73.7 | NA | Max. :10000 | Max. :1250 | Max. :629 | Max. :2019-07-08 | Max. :58 | Max. :327 | Max. :365 |
| NA | NA | NA | NA | NA | NA | (Other) :32209 | NA | NA | NA | NA | NA | NA | NA’s :10052 | NA’s :10052 | NA | NA |
First, we look at the price distribution for the entire dataset:
library(ggplot2)
ggplot(data=airbnb, aes(price)) +
geom_histogram(bins=100,
col="dark blue",
fill="light blue",
alpha = .7) + # opacity
labs(x="Price", y="Frequency") +
labs(title="Histogram of AirBnB Price (All Observations)")
qqnorm(airbnb$price, pch = 20, main = "Q-Q Plot for AirBnB Prices (All Observations)")
qqline(airbnb$price, col = "black", lwd = 2)
Price is not normally distributed, and there appear to be a large number of outliers. If we remove those outliers, the distribution is close to normal.
airbnb_clean = outlierKD2(airbnb, price, rm = TRUE, boxplt = TRUE, qqplt = TRUE)
## Outliers identified: 2972
## Propotion (%) of outliers: 6.5
## Mean of the outliers: 659
## Mean without removing outliers: 153
## Mean if we remove outliers: 120
## Outliers successfully removed
library(ggplot2)
ggplot(data = airbnb_clean, aes(price)) +
geom_histogram(bins = 100,
col = "dark blue",
fill = "light blue",
alpha = .7) + # opacity
labs(title = "Histogram of AirBnB Price (Outliers removed)",
x = "AirBnB Price",
y = "Frequency") +
theme_grey()
qqnorm(airbnb_clean$price, pch = 20, main = "Q-Q Plot for AirBnB Prices (Outliers Removed)")
qqline(airbnb_clean$price, col = "black", lwd = 2)
Scatter plot for price and number of reviews without any transformations:
library(ggplot2)
library(ggpubr)
ggplot(airbnb, aes(x=price, y=number_of_reviews)) +
ggtitle("Number of Reviews vs Price Scatter Plot") +
xlab("Price ($)") + ylab("Number Of Reviews") +
geom_point(size = 1, shape = 18, color = "black") +
geom_smooth(method = lm, se = FALSE, color = "yellow", size = 1.2) + theme_bw() +
stat_cor(method = "pearson", label.x = 6500 )
There appears to be a an exponential relationship (exponential decline) between price and number of reviews.
Scatter plot for price and number of reviews taking the \[\log (reviews)\] to show linear trend:
library(ggplot2)
library(ggpubr)
ggplot(airbnb, aes(x=price, y=log(number_of_reviews))) +
ggtitle("Number of Reviews vs Price Scatter Plot") +
xlab("Price ($)") + ylab("Number Of Reviews") +
geom_point(size = 1, shape = 18, color = "black") +
geom_smooth(method = lm, se = FALSE, color = "yellow", size = 1.2) + theme_bw() +
stat_cor(method = "pearson", label.x = 6500 )
Note: outliers extend past $1,000 per night, graph truncated for visibility and interpretation.
library(ggplot2)
ggplot(airbnb, aes(price, factor(neighbourhood_group))) +
geom_boxplot(color = "black", fill = c("light green", "pink","light blue", "yellow", "orange")) +
labs(title = "Neighbourhood Group vs Price Box plot", x = "Price", y = "Neighbourhood group") +
xlim(0, 1000)
Note: outliers extend past $1,000 per night, graph truncated for visibility and interpretation.
library(ggplot2)
ggplot(airbnb, aes(price, factor(room_type))) +
geom_boxplot(width = 0.7, color = "black", fill = c("light green", "yellow","light blue")) +
labs(title = "Room type vs Price Box plot", x = "Price", y = "Room Type") + xlim(0, 1000)
#install.packages("plotly")
library(plotly)
fig <- airbnb
fig <- fig %>%
plot_ly(
lat = ~latitude,
lon = ~longitude,
color = ~neighbourhood_group,
colors = "Set1",
type = 'scattermapbox')
fig <- fig %>%
layout(
mapbox = list(
style = 'open-street-map',
zoom =9,
center = list(lon = -73.97, lat = 40.71)))
fig
# create subset just for aggregating by mean
airbnb_map <- airbnb[ , c(6, 7, 8, 10)]
airbnb_map_means <- aggregate(.~neighbourhood, airbnb_map, mean)
# create subset for aggregating by count
airbnb_count <- airbnb_map
airbnb_count$count <- 1
airbnb_count <- airbnb_count[, c(1,5)]
airbnb_counter <- aggregate(.~neighbourhood, airbnb_count, sum)
# create full dataset from both subsets
airbnb_map_full <- cbind(airbnb_counter, airbnb_map_means)
# check that union occured correctly, then drop extra neighborhood value
all.equal(airbnb_map_full[, 1], airbnb_map_full[, 3]) # true!
airbnb_map_full <- airbnb_map_full[, -3]
#Load the library
library(ggplot2)
library(ggmap)
#Set your API Key
ggmap::register_google(key = "AIzaSyBuM2zUJqBlgjcki9tYS1emZr3awesSqac")
#map by price
newyork.map <- get_map("New York", zoom = 10, scale = 1, maptype = "terrain")
ggmap(newyork.map) + geom_point(data = airbnb_map_full, aes(x = longitude, y = latitude, colour = price, size = count), alpha = 0.5) +
scale_colour_gradientn(colours=rainbow(3)) +
labs(title = "Average AirBnb Price by Neighborhood")
loadPkg("faraway")
loadPkg("corrplot")
xkabledply(cor(airbnb_cor))
| price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|
| price | 1.0000 | 0.0428 | -0.0480 | NA | 0.0575 | 0.0818 |
| minimum_nights | 0.0428 | 1.0000 | -0.0801 | NA | 0.1280 | 0.1443 |
| number_of_reviews | -0.0480 | -0.0801 | 1.0000 | NA | -0.0724 | 0.1720 |
| reviews_per_month | NA | NA | NA | 1 | NA | NA |
| calculated_host_listings_count | 0.0575 | 0.1280 | -0.0724 | NA | 1.0000 | 0.2257 |
| availability_365 | 0.0818 | 0.1443 | 0.1720 | NA | 0.2257 | 1.0000 |
airbnb_corplot = cor(airbnb_cor, use = "complete.obs")
corrplot(airbnb_corplot, method = "circle")
No strong correlations with price, but minimum_nights and availability_365, number_of_reviews and availability_365, and calculated_host_listings_count and availability_365 show some evidence of positive correlation.
cor.test(x=airbnb_cor$reviews_per_month, y=airbnb_cor$number_of_reviews)
##
## Pearson's product-moment correlation
##
## data: airbnb_cor$reviews_per_month and airbnb_cor$number_of_reviews
## t = 130, df = 38841, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.543 0.557
## sample estimates:
## cor
## 0.55
As expected, correlated since reviews per month is a function of total number of reviews so do not need to look at both.
cor.test(y=airbnb$number_of_reviews, x=airbnb$price)
##
## Pearson's product-moment correlation
##
## data: airbnb$price and airbnb$number_of_reviews
## t = -11, df = 48893, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0568 -0.0391
## sample estimates:
## cor
## -0.048
No evidence of strong (linear) correlation, but evidence of an inverse relationship between price and reviews (higher price, fewer reviews–possibly because of fewer stays, for which review number is probably a good proxy)
#anova test for price and neighborhood groups
anova_price_group = aov(price ~ neighbourhood_group, data=airbnb)
summary(anova_price_group) -> sum_anova_price_group
xkabledply(sum_anova_price_group, title = "ANOVA result summary for Neighborhood Groups")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| neighbourhood_group | 4 | 7.96e+07 | 19897739 | 355 | 0 |
| Residuals | 48890 | 2.74e+09 | 56051 | NA | NA |
tukeyAoV_pg <- TukeyHSD(anova_price_group)
tukeyAoV_pg
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ neighbourhood_group, data = airbnb)
##
## $neighbourhood_group
## diff lwr upr p adj
## Brooklyn-Bronx 36.89 16.81 57.0 0.000
## Manhattan-Bronx 109.38 89.34 129.4 0.000
## Queens-Bronx 12.02 -9.33 33.4 0.539
## Staten Island-Bronx 27.32 -11.42 66.1 0.305
## Manhattan-Brooklyn 72.49 66.17 78.8 0.000
## Queens-Brooklyn -24.87 -34.58 -15.2 0.000
## Staten Island-Brooklyn -9.57 -43.32 24.2 0.938
## Queens-Manhattan -97.36 -106.99 -87.7 0.000
## Staten Island-Manhattan -82.06 -115.79 -48.3 0.000
## Staten Island-Queens 15.29 -19.23 49.8 0.746
#anova test for price and room type
anova_price_room = aov(price ~ room_type, data=airbnb)
summary(anova_price_room) -> sum_anova_price_room
xkabledply(sum_anova_price_room, title = "ANOVA result summary for Room Type")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| room_type | 2 | 1.85e+08 | 92512441 | 1717 | 0 |
| Residuals | 48892 | 2.63e+09 | 53892 | NA | NA |
tukeyAoV_pr <- TukeyHSD(anova_price_room)
tukeyAoV_pr
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ room_type, data = airbnb)
##
## $room_type
## diff lwr upr p adj
## Private room-Entire home/apt -122.0 -127 -117.02 0.000
## Shared room-Entire home/apt -141.7 -158 -125.33 0.000
## Shared room-Private room -19.7 -36 -3.27 0.014